Physiologically motivated audio-visual localisation and tracking
نویسندگان
چکیده
An audio-visual localisation and tracking system for meeting scenarios is presented which draws its inspiration from neurobiological processing. Meetings are recorded by a KEMAR binaural manikin and a single camera placed directly above the manikin. Source localisation from the binaural audio and face, object and motion locations from the video frames are used as input to two linked neural oscillator networks. The strength of the connections between the two networks determines the mapping between activity at a particular audio azimuth and activity at a particular visual frame column. A Hebbian learning rule is used to establish the connection strengths. The combined network segments the video and audio features and then produces audio-visual groupings on the basis of common spatial location. The audio-visual groupings are tracked through time using a mechanism based upon that of the human oculomotor system which incorporates smooth pursuit and saccadic movement.
منابع مشابه
Physiologically Motivated Audio-Visua
An audio-visual localisation and tracking system for meeting scenarios is presented which draws its inspiration from neurobiological processing. Meetings are recorded by a KEMAR binaural manikin and a single camera placed directly above the manikin. Source localisation from the binaural audio and face, object and motion locations from the video frames are used as input to two linked neural osci...
متن کاملActive Speaker Localisation and Tracking using Audio and Video
This thesis is concerned with the problem of tracking active speakers using audio and video data. Particular focus is placed on the task of tracking the current active speaker in a lecture room environment using multiple cameras and multiple microphones. A database of lecture recordings corresponding to this scenario from the European Integrated Project, Computers in the Human Interaction Loop ...
متن کاملHMM-based audio-visual speech recognition integrating geometric and appearance-based visual features
A good front end for visual feature extraction is an important element of audio-visual speech recognition systems. We propose a new visual feature representation that combines both geometricand pixel-based features. Using our previously developed contour-based lip-tracking algorithm, geometric features including the height and width of the lips are automatically extracted. Lip boundary tracking...
متن کاملAudio-visual Multiple Active Speaker Localisation in Reverberant Environments
Localisation of multiple active speakers in natural environments with only two microphones is a challenging problem. Reverberation degrades the performance of speaker localisation based exclusively on directional cues. This paper presents an approach based on audio-visual fusion. The audio modality performs the multiple speaker localisation using the Skeleton method, energy weighting, and prece...
متن کاملThe Ta2 Database - a Multi-modal Database from Home Entertainment
This paper presents a new database containing highdefinition audio and video recordings in a rather unconstrained video-conferencing-like environment. The database consists of recordings of people sitting around a table in two separate rooms communicating and playing online games with each other. Extensive annotation of head positions, voice activity and word transcription has been performed on...
متن کامل